Pure High-Order Word Dependence Mining via Information Geometry

نویسندگان

  • Yuexian Hou
  • Liang He
  • Xiaozhao Zhao
  • Dawei Song
چکیده

The classical bag-of-word models fail to capture contextual associations between words. We propose to investigate the “high-order pure dependence” among a number of words forming a semantic entity, i.e., the high-order dependence that cannot be reduced to the random coincidence of lower-order dependence. We believe that identifying these high-order pure dependence patterns will lead to a better representation of documents. We first present two formal definitions of pure dependence: Unconditional Pure Dependence (UPD) and Conditional Pure Dependence (CPD). The decision on UPD or CPD, however, is a NP-hard problem. We hence prove a series of sufficient criteria that entail UPD and CPD, within the well-principled Information Geometry (IG) framework, leading to a more feasible UPD/CPD identification procedure. We further develop novel methods to extract word patterns with high-order pure dependence, which can then be used to extend the original unigram document models. Our methods are evaluated in the context of query expansion. Compared with the original unigram model and its extensions with term associations derived from constant n-grams and Apriori association rule mining, our IG-based methods have proved mathematically more rigorous and empirically more effective.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information Measures via Copula Functions

In applications of differential geometry to problems of parametric inference, the notion of divergence is often used to measure the separation between two parametric densities. Among them, in this paper, we will verify measures such as Kullback-Leibler information, J-divergence, Hellinger distance, -Divergence, … and so on. Properties and results related to distance between probability d...

متن کامل

Characterizing Pure High-Order Entanglements in Lexical Semantic Spaces via Information Geometry

An emerging topic in Quantuam Interaction is the use of lexical semantic spaces, as Hilbert spaces, to capture the meaning of words. There has been some initial evidence that the phenomenon of quantum entanglement exists in a semantic space and can potentially play a crucial role in determining the embeded semantics. In this paper, we propose to consider pure high-order entanglements that canno...

متن کامل

Mining the Web for Sense Discrimination Patterns

In this paper we present a method for mining the Web in order to extract lexical patterns that help in discriminating the senses of a given polysemic word. These patters are defined as sets and sequences of words strongly related to each sense of the word. To discover the patterns, the method first determines the different senses of the word from a reference lexical database, and then it uses t...

متن کامل

Perform Three Data Mining Tasks with Crowdsourcing Process

For data mining studies, because of the complexity of doing feature selection process in tasks by hand, we need to send some of labeling to the workers with crowdsourcing activities. The process of outsourcing data mining tasks to users is often handled by software systems without enough knowledge of the age or geography of the users' residence. Uncertainty about the performance of virtual user...

متن کامل

A Geometry Preserving Kernel over Riemannian Manifolds

Abstract- Kernel trick and projection to tangent spaces are two choices for linearizing the data points lying on Riemannian manifolds. These approaches are used to provide the prerequisites for applying standard machine learning methods on Riemannian manifolds. Classical kernels implicitly project data to high dimensional feature space without considering the intrinsic geometry of data points. ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011